White Wine Quality the data was downloaded from the following site:White Wine Quality as project 6 of Data Analyst for Enterprise Nanodegree Program
## [1] 4898 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
explore the data
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
there are aa 4898 observations and 13 variables but we can find that X variables is a counter for observation ,it is better to dropped it
since Quality are a measure we should change the Quality Variables to the ordinal catagorical variable
quality factorize it
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
print statistical summary for the data
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.cat
## Min. :3.000 3: 20
## 1st Qu.:5.000 4: 163
## Median :6.000 5:1457
## Mean :5.878 6:2198
## 3rd Qu.:6.000 7: 880
## Max. :9.000 8: 175
## 9: 5
the meausr of quality is between 3 to 9 , looking how the observation ’s quality are distrbuite
frequncy of the quality of the wine in table
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The distribution of the quality of wine looks normal the lowestquality are 3 and hight are 9, there is no 1,2 or 10 marks for rating it is better to bucket the quality to 3 class ( Low ,Average , High )
The first thing to look for alcoholic beverages is alcohol percent . /n
alcohol
An appropriate level of alcohol enhances the flavor, but more of it could give you low quality . The median is 10.4% and the majority of values fall between 9% to 13%.
PH
PH has a small range between 2.7 to 3.8 which mean acetic!
Fixed acidity
The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.
The distribution of acidity is very close to normal distribution. But there exist some outliers in the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
volatile acidity
volatile acidity (acetic acid - g / dm3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
change the histogram to get clear picture
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Citric.acid
The citric acid distribution looks quite normal with a median of 0.32 and a mean of 0.3342. and there is a peak too at around 0.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Residual Sugar
Residual sugar has a wide range between 0.6-65.8g/l while the median is only 5.2g/l. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.
After adjusting the histogram
chlorides
Most wines has an amount of sodium chloride between 0.025-0.06g/l, with a median of 0.043g/l. The highest level in this dataset is 0.346g/l.
Looks like there are some outliers in this distribution.
Sulphate
sulphate distribution plots seem approximately normal, with a median of 0.4700 and a mean of 0.4898. Again in this second plot we are dropped the top and bottom 1% of sulphate values. These plots seem slightly bimodal or trimodal, however, I don’t really know. The peaks seem too close together to classify this plot as bimodal, but maybe I should be considering the small scale of this data..
The data seem to be a bit right skewed. However, there exist very few outliers. Let’s narrow the bin sizes.
I transformed the scale to log10 to better visualize the distribution.
summery of sulphates :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
density
Most of the density values are between .99 and 1.00 g / cm3, but there are some outliers near 1.01 and 1.04. With a mean of 0.9937 and a median of 0.994. distribution has a longer right tail than left tail, which can be seen more clearly in the first plot.
summary of Density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Free.sulfur.dioxide
There exist so many outliers as most other features. We should trim the outliers to make better analysis. First, lets arrange binwidths to obtain deeper insight.
If we look also the summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
There exist extremely large variables similar to other variables. If the top 1 percentile is omitted:
This time, the distribution is quite better and similar to normal. We can see that only very small amount of data have extreme values. The skewness in the data is very low.
Total.sulfur.dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The data follows a pattern similar to previous variables. There exist extremely large variables and a few outliers but most of the data has a bell-shaped normal-like distribution. If the top 1 percentile is omitted the below distribution is obtained.
Most of the data values are between 50 and 240.
Both the distribution of free sulfur dioxide and total sulfur look like normal distributions. The free sulfur dioxide distribution has a mean of 35.31 and a median of 34.00, the total sulfur dioxide distribution has a mean of 138.4 and a median of 134.0. In both cases the mean is larger than the median, this difference is more noticeable in the distribution of total sulfur dioxide. In the total sulfur dioxide distribution it appears that width of distribution is larger for values above the meave rather than below the mean. So the free sulfur dioxide graph looks more normal.
4898 observations of wine with 12 variables ###What is/are the main feature(s) of interest in your dataset? The main features in the data are quality, alcohol, residual.sugar, density.
Sulfur dioxide, citric acid, clorides.
create new quality variable quality.with 3 group ( Low ,Average , High )
citric acid has two unusual peaks which standed out of an otherwise normal distribution.
I didlog transformation on the sulphates distributions, because it was skewed, and the transformations allowed better visualizations of the data.
in the first correlation matrix to see the realation between variables
##
## CORRELATIONS
## ============
## - correlation type: pearson
## - correlations shown only when both variables are numeric
##
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity . -0.023 0.289
## volatile.acidity -0.023 . -0.149
## citric.acid 0.289 -0.149 .
## residual.sugar 0.089 0.064 0.094
## chlorides 0.023 0.071 0.114
## free.sulfur.dioxide -0.049 -0.097 0.094
## total.sulfur.dioxide 0.091 0.089 0.121
## density 0.265 0.027 0.150
## pH -0.426 -0.032 -0.164
## sulphates -0.017 -0.036 0.062
## alcohol -0.121 0.068 -0.076
## quality -0.114 -0.195 -0.009
## quality.cat . . .
## quality.bucket . . .
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar . 0.089 0.299
## chlorides 0.089 . 0.101
## free.sulfur.dioxide 0.299 0.101 .
## total.sulfur.dioxide 0.401 0.199 0.616
## density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## quality.cat . . .
## quality.bucket . . .
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.091 0.265 -0.426 -0.017 -0.121
## volatile.acidity 0.089 0.027 -0.032 -0.036 0.068
## citric.acid 0.121 0.150 -0.164 0.062 -0.076
## residual.sugar 0.401 0.839 -0.194 -0.027 -0.451
## chlorides 0.199 0.257 -0.090 0.017 -0.360
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059 -0.250
## total.sulfur.dioxide . 0.530 0.002 0.135 -0.449
## density 0.530 . -0.094 0.074 -0.780
## pH 0.002 -0.094 . 0.156 0.121
## sulphates 0.135 0.074 0.156 . -0.017
## alcohol -0.449 -0.780 0.121 -0.017 .
## quality -0.175 -0.307 0.099 0.054 0.436
## quality.cat . . . . .
## quality.bucket . . . . .
## quality quality.cat quality.bucket
## fixed.acidity -0.114 . .
## volatile.acidity -0.195 . .
## citric.acid -0.009 . .
## residual.sugar -0.098 . .
## chlorides -0.210 . .
## free.sulfur.dioxide 0.008 . .
## total.sulfur.dioxide -0.175 . .
## density -0.307 . .
## pH 0.099 . .
## sulphates 0.054 . .
## alcohol 0.436 . .
## quality . . .
## quality.cat . . .
## quality.bucket . . .
there is a strong correlations between free sulfur dioxide, total sulfur dioxide and the constructed variables bound sulfur dioxide and sulfur dioxide ratio.
It also shows interesting relations between :
1 -residual.sugar vs density 2- and of course I am gioing to analysis the mean Feature Qulity with some intersting variabkes ()
Density VS residual sugar
alcohil vs density :
There is a strong negative correlation between density and alcohol.when the percent of alcohol increases the density will decreases.
Alcohol VS Total.sulfur.dioxide:
This is a moderate positive correlation between density and total sulfur dioxide. We have a moderate negative relationship between alcohol and total sulfur dioxide.
Free.sulfur.dioxide vs Total.sulfur.dioxide :
We observe a moderate positive correlation between total sulfur dioxide and free sulfur dioxide.
PH VS Fixed.acidity :
This correlation makes logical sense because as fixed acidity increases, the pH value becomes more acidic.
Density, Residual.sugar :
Density, Total.sulfur.dioxide :
This is a strong positive correlation between density and residual sugar.
Quality VS Alcohol :
There is a moderate positive correlation between alcohol and quality. We can see that as the alcohol increases the rating slightly increases as well.
trying to apply a linear model to the alcohol and quality scatterplot
##
## Call:
## lm(formula = I(alcohol) ~ I(quality), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2986 -0.7882 -0.1382 0.8014 4.1223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.95670 0.10626 65.47 <2e-16 ***
## I(quality) 0.60524 0.01788 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.108 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Quality VS Sulphates:
There is no meaningful relationship between sulphates and quality. This wine additive has no impact on quality.
Quality VS PH
There is no Clear relationship between pH and quality.
Quality VS Density
there is a small negative relationship between quality and density. As the density increases, the quality decreases.
Quality VS Total.sulfur.dioxide :
there is a small negative relationship between total sulfur dioxide and quality. This means that as the total sulfur dioxide increases the quality decreases.
Quality VS Free.sulfur.dioxide :
No clear relation between free sulfur dioxide and quality.
Quality VS Chlorides :
There is a small negative correlation between chlorides and quality. If the amount of salt increases the quality decreases.
Quality vs Residual.sugar :
There is no meaningful relationship between residual sugar and quality.
quality VS citric.acid :
There is no meaningful relationship between citric acid and quality.
Quality VS Volatile.acidity:
Small negative correlation between volatile acidity and quality.
Quality vs Fixed Acidity :
There is no clear correlation between fixed acidity and quality.
I analyzed the relationships between the variables in this dataset The quality variable which is my main varivale has two largest correlations with alcohol (.436) and density (-.307). and when I got closer to more detail in quality cat and the alcohol level I fount increases from 3 to 5, the quality tends downwards. As the alcohol level increases from 5 to 9, the quality tends upwards.
Density is strongly correlated with residual sugar and alcohol.
and there is a relationship between fixed acidity and pH
The strongest correlations I found are between other features. Strong positive correlation between residual sugar and density, as the amount of sugar increases the density increases. Another strong relationship was observed between density and alcohol.As the percent of alcohol increases the density decreases.
In this plot we can see that the average alcohol percent is higher for the wines with higher quality rating.
When I add quality_grouped into alcohol-residual.sugar I observe that high-quality (dark blue points) wines generally have high level alcohol.
An important implication of the graph is that high quality wines generally have low residual.sugar level (less than 5). However, low level of sugar does not mean high quality wines. There exist so many wine types which are low quality and includes low residual.sugar. /n
I will change alcohol to bins to be able to plot density, alcohol, residual sugar and quality together and see how they relate with eachother:
We observe here a moderate negative relationship between alcohol and total sulfur dioxide wrapped by quality, especially for the wines that are rated above 5.
fixed.acidity and pH, quality.cat
This plot shows pH for each quality in relationship with fixed acidity.
These variables were amongst the most important ones .
Higher rated wines have a lower total sulfur dioxide, which means that in low concentrations sulfur dioxide is mostly undetectable.
Higher rated wines have a lower amount of sugar than the other rated categories.
A very interesting relation is shown in this chart. Given a value of residual sugar, density increases as alcohol decreases. This is in some extent due to the fermentation process of winemaking, in which sugar is consumed to generate alcohol. Since alcohol is less dense than water and sugar is more dense than water, this process makes the density of the wine decrease.
The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.
The negative relationship between alcohol and residual sugar is deteched. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly.
on this exploratory data analysis we did 3 steps 1 -a univariate, 2-bivariate and 3- finally multivariate
I did exmaine the variable and realtionships between them there was Some interesting relations came up, like the one between alcohol, density, residual sugar and quality, that could be related to the fermentation process of wine. The correlation between pH and fixed acidity, while not correlating with volatile acidity and citric acids is also worth noting.
The challenges I enocountered were the fact the variables were not clearly explained as they represent chemical properties.
I think it would be interesting to have more even-classed dataset. More low and high quality wines to better visualize these trends. In addition to that I think it would be interesting to see the price of the wine too,
Also the analyze I did depended on relationships between correlating variables, but there are for sure a non correlating factors that still need more investigation in terms in chemical and Psychologically factors that affect the quality of wine